Noise Elimination from the Web Documents by Using URL Paths and Information Redundancy

نویسندگان

  • Byeong Ho Kang
  • Yang Sok Kim
چکیده

on the performance of the Web information management system. Many researchers have proposed document structure based noise data elimination methods. In this paper, we propose a different approach that uses a redundant information elimination approach in the Web documents from the same URL path. We propose a redundant word/phrase filtering method for single or multiple tokenizations. We conducted two experiments to examine efficiency and effectiveness of our filtering approaches. Experimental results show that our approach produces a high performance in these two criteria.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Impulsive Noise Elimination Considering the Bit Planes Information of the Image

Impulsive noise is one of the imposed defectives degrades the quality of images. Performance of many image processing applications directly depends on the quality of the input image. Hence, it is necessary to de-noise the degraded images without losing their valuable information such as edges. In this paper we propose a method to remove impulsive noise from color images without damaging the ima...

متن کامل

Semantic web access prediction using WordNet

The user observed latency of retrieving Web documents is one of limiting factors while using the Internet as an information data source. Prefetching became important technique to reduce the average Web access latency. Existing prefetching methods are based predominantly on URL graphs. They use the graphical nature of HTTP links to determine the possible paths through a hypertext system. Althoug...

متن کامل

Webclass: Web Document Classiication Using Modiied Decision Trees

Searching for Web sites is one of the most common tasks performed on the Web. Web page classi cation is the rst step for Web search service construction. This paper proposes a system, named WebClass, for classifying Web documents by using a height-three modi ed decision tree which splits the root, depth-one nodes, and depth-two nodes on the keywords, descriptions, and hyperlinks, respectively. ...

متن کامل

A Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection

A Web Page has large amount of information including some additional contents like hyperlinks, header footer, navigational panel; advertisements which may cause the content extraction to be complicated. Page Segmentation is used to detect the noisy content block by detecting malicious URL from Web Pages. Main aim of this research is detecting malicious URL during content extraction by checking ...

متن کامل

Noise-tolerance feasibility for restricted-domain Information Retrieval systems

Information Retrieval systems normally have to work with rather heterogeneous sources, such as Web sites or documents from Optical Character Recognition tools. The correct conversion of these sources into flat text files is not a trivial task since noise may easily be introduced as a result of spelling or typeset errors. Interestingly, this is not a great drawback when the size of the corpus is...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006